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Abstract 

In  current  state  of  the  art  statistical  MT  systems,  word  choice  in  the  target  language  is  governed 
implicitly  by  a  combination  of  “phrase”  selection  and  langage  modeling.  In  contrast,  the  state  of 
the  art  in  word  sense  disambiguation  takes  advantage  of  a  wide  array  of  features,  both  locally  and 
at  the  document  level.  This  technical  report  describes  our  initial  efforts  to  employ  the  power  of 
WSD  techniques  in  helping  to  guide  a  state  of  the  art  statistical  MT  system  toward  better  word 
choices. 

We  briefly  discuss  the  principles  underlying  our  approach  as  contrasted  with  another  recent  at¬ 
tempt  to  integrate  WSD  with  statistical  MT  (Carpuat  and  Wu,  2005)  that  yielded  negative  results. 
We  then  describe  our  approach,  which  leads  to  a  small  improvement  in  translation  performance 
over  a  state  of  the  art  phrase-based  statistical  MT  system.  Qualitative  analysis  of  translation 
output  suggests  there  are  still  significant  opportunities  to  improve  performance  further. 
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1  Introduction 


Statistical  machine  translation  is  widely  viewed  as  involving  two  central  problems:  the  problem  of 
selecting  the  words  in  the  target  language,  and  the  problem  obtaining  appropriate  word  order.  In 
the  IBM  models  of  the  early  1990s  (Brown  et  ah,  1990),  selection  of  target-language  lexical  items 
was  governed  by  a  combination  of  two  main  components:  a  table  of  lexical  translation  probabilities 
Pr(/j|ej)  for  words  J\  and  6j,  and  the  language  model,  determining  target-language  probability 
Pr(ei . . .  en)  typically  using  an  IV-gram  model.  Context  played  a  role  in  lexical  disambiguation 
primarily  monolingually,  with  the  language  model  biasing  the  decoder  toward  selecting  a  word  ej 
congruent  with  the  previous  N  —  1  words  in  the  target-language  hypothesis.  Bilingual  influences 
on  lexical  selection  did  not  directly  involve  any  context  external  to  the  word  being  translated, 
although  summing  over  alternative  alignments  did  allow  probabilities  to  influence  the 

selection  of  ej  for  fo  elsewhere  in  the  sentence,  via  an  indirect  sort  of  “triggering”  effect.  Since 
translation  was  done  strictly  on  a  sentence-by-sentence  basis,  document-level  context  played  no 
role  at  all  in  influencing  word  choice. 

Since  the  late  1990s,  phrase-based  statistical  translation  models  have  represented  the  state 
of  the  art  in  statistical  MT  (e.g.  (Och  and  Ney,  2004;  Koehn  et  ah,  2003;  Marcu  and  Wong, 
2002;  Kumar  et  ah,  2005)).  A  phrase  table  relates  contiguous  word  sequences  /  and  e,  capturing 
both  local  reordings  and  local  constraints  on  word  choice  —  for  example,  Spanish- English  phrase 
correspondences  would  assign  high  probability  to  the  relationship  between  a  escala  mundial  and 
on  a  global  scale.  In  addition  to  the  adjective- noun  reordering,  the  phrase  table  captures  the  fact 
that,  in  the  context  of  translating  this  Spanish  phrase,  scale  is  a  more  typical  English  translation 
choice  compared  to  closely  related  words  such  as,  say,  magnitude.  This  choice  is,  of  course,  likely 
to  be  reinforced  monolingually  by  the  language  model,  since  the  probability  of  scale  given  on  a 
global...  is  high.1 

Statistical  phrases  are  a  positive  step  in  lexical  selection  for  statistical  MT,  in  that  they  help 
take  better  advantage  of  local  context  —  long  known  to  be  an  influential  factor  in  determining 
word  meaning  in  context  (Yarowsky,  1993).  However,  research  on  the  determination  of  word 
meaning  in  context  has  been  converging  on  the  idea  that  there  are  actually  a  whole  variety 
of  features  that  can  play  a  role.  In  the  2004  senseval-3  exercise  (Mihalcea  and  Edmonds, 
2004),  word  sense  disambiguation  systems  took  advantage  not  only  of  string-local  features,  but 
also  of  local  part-of-speech  information,  sentence-level  grammatical  collocates,  and  less  local  or 
document-level  features  such  as  document  topic  codes,  co-occurring  words  and  IV-grams,  and  so 
on.  Consistent  with  Yarowsky  and  Florian’s  (2002)  observations  that  no  single  classifier  is  a  one- 
size-fits-all  solution,  the  best  performing  systems  took  advantage  of  feature  variety  and  classifier 
combination  approaches.  This  naturally  raises  the  question  of  whether  the  same  techniques  could 
have  advantages  in  lexical  selection  for  statistical  MT,  which  bears  a  very  close  relationship  to 
monolingual  WSD  (e.g.  see  (Dagan  and  Itai,  1994;  Resnik  and  Yarowsky,  1999;  Diab  and  Resnik, 
2002)). 

We  are  undertaking  to  explore  this  question,  and  this  technical  report  presents  a  picture  of 
where  the  research  currently  stands.  Our  research  has  been  guided  thus  far  by  several  general 
considerations. 

xSee  Edmonds  and  Hirst  (2002)  for  a  knowledge-based  approach  to  the  closely  related  problem  of  selecting 
among  near-synonyms  in  natural  language  generation. 
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•  First,  it  is  clear  from  the  senseval  exercises  that  the  most  successful  WSD  techniques  take 
advantage  of  supervised  training,  and  the  more  data  the  better. 

•  Second,  our  experience  in  MT  suggests  caution  when  attempting  to  exploit  data  and  knowl¬ 
edge  resources  external  to  the  bilingual  training  materials,  such  as  separate  WSD-training 
corpora  or  previously  defined  sense  inventories.  As  a  case  in  point,  among  systems  repre¬ 
sented  in  the  NIST  MT  evaluations,  the  only  statistical  MT  system  to  exploit  hierarchical 
syntax  successfully  —  the  University  of  Maryland’s  Hiero  system  (Chiang,  2005)  —  learns 
its  synchronous  context-free  rules  directly  from  the  training  bitext,  rather  than  trusting  a 
parser  trained  on  an  external  corpus. 

•  Third,  and  related,  experience  in  statistical  MT  suggests  that  one  should  be  cautious  about 
turning  uncertain  decisions  into  hard  constraints.  Trusting  (or,  more  to  the  point,  not  fully 
trusting)  the  one-best  output  of  source  language  parser  is  one  example.  Another  is  Franz 
Och’s  use  of  rule-based  translation  components,  e.g.  for  dates,  numbers,  bylines,  etc.  These 
are  not  integrated  into  his  system  as  hard  translation  choices,  but  rather  as  dynamically 
generated  phrase  table  entries  that  can  be  weighed  by  the  decoder  in  the  context  of  the 
entire  search.2 

We  believe  these  three  considerations  are  most  likely  responsible  for  the  negative  (but  nonethe¬ 
less  quite  interesting)  results  recently  reported  by  Carpuat  and  Wu  (2005).  In  their  (admittedly 
first-pass)  attempt  to  integrate  state  of  the  art  WSD  with  a  Chinese-English  MT  system,  they  used 
a  WSD  system  trained  on  a  relatively  small  dataset  (about  37  training  instances  per  target  word), 
their  training  dataset  and  and  sense  inventory  were  unrelated  to  the  bilingual  training  data,  and 
they  integrated  WSD  output  via  hard  constraints  (either  forcing  a  choice  among  WSD-derived 
candidates  at  decoding  time,  or  replacing  target  words  in  postprocessing). 

In  the  work  reported  here,  we  are  using  target  vocabulary  items  directly  as  “senses”,  thus 
bypassing  entirely  the  question  of  an  externally  defined  sense  inventory.  To  the  extent  that  word- 
level  alignments  are  accurate  (a  non-trivial  question,  of  course),  aligned  bitext  can  provide  large 
quantities  of  material  to  train  from  —  for  example,  a  sentence  pair  containing  escala  aligned  with 
scale  provides  a  training  instance  for  the  former  word  “tagged”  as  the  latter.  As  discussed  below, 
we  integrate  WSD  choices  as  soft  decisions  by  taking  advantage  of  a  phrase-based  statistical  MT 
system  (Koehn,  2004)  that  optionally  permits  the  specification  of  confidence- weighted  translation 
alternatives  in  the  source-language  input,  giving  the  decoder  the  choice  of  whether  to  use  the 
specified  translations  or  those  suggested  by  its  translation  model. 

In  Section  2,  we  briefly  describe  our  WSD  framework,  and  how  it  has  been  adapted  for  lexical 
selection  in  a  phrase-based  statistical  MT  system.  In  Section  3,  we  describe  our  preliminary 
experiments,  including  quantitative  evaluation,  qualitative  analysis,  and  thoughts  on  additional 
improvements.  Section  4  summarizes  and  concludes. 

2  Using  WSD  for  Lexical  Selection 

Beyond  the  general  considerations  outlined  in  the  Introduction,  our  priority  in  using  WSD  tech¬ 
niques  for  lexical  selection  is  a  flexible  infrastructure  that  permits  an  active  cycle  of  experimenta- 

2Source:  Presentation  at  NIST  Machine  Translation  Workshop,  June  20-21,  2005. 
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tion,  data  analysis,  and  algorithm  refinement.  For  that  reason,  we  use  the  UMD-SST  supervised 
WSD  system  (Cabezas  et  ah,  2001;  Cabezas  et  ah,  2004),  which  is  based  on  a  support  vector 
machine  classifier  and  supports  a  wide  variety  of  local  and  less  local  contextual  features. 

2.1  WSD  Infrastructure 

The  UMD-SST  WSD  system  is  described  in  detail  in  our  senseval  workshop  publications  (Cabezas 
et  ah,  2001;  Cabezas  et  ah,  2004).  Briefly,  the  system  follows  the  classic  supervised  learning 
paradigm,  using  a  single  SVM  classifier;  each  word  in  the  vocabulary  is  considered  an  inde¬ 
pendent  classification  problem.  First,  annotated  training  instances  for  the  ambiguous  word  are 
analyzed  so  that  each  instance  can  be  represented  as  a  collection  of  feature-value  pairs  labeled 
with  the  correct  category.  Then,  these  data  are  used  for  parameter  estimation  within  the  super¬ 
vised  learning  framework  in  order  to  produce  a  trained  classifier.  Finally,  the  trained  classifier  is 
given  previously  unseen  test  instances  and  for  each  instance  it  yields  a  confidence  score  for  each 
of  the  possible  category  labels. 

Contextual  features  available  in  the  current  system  include  local  collocational  features  within 
a  window  of  plus-or-minus  3  words,  grammatical  collocations  within  the  sentence,  and  unigrams 
found  within  a  given  extra- sentential  context.  Features  are  weighted  using  inverse  category  fre¬ 
quency  (ICF),  which  is,  by  analogy  with  inverse  document  frequency  (IDF),  a  function  of  how 
many  distinct  categories  a  feature  appears  with  in  training  data.  Features  that  occur  with  most 
senses  of  a  word  have  low  ICF;  those  more  heavily  skewed  toward  fewer  senses  have  high  ICF.  In 
disambiguating  a  word  w  with  senses  S  =  {sl5  s2, . . . ,  sNw},  we  define  ICF w(f)  =  —  log (Nf/Nw) 
where  Nf  is  the  number  of  distinct  elements  of  S  that  ever  co-occur  with  feature  /  in  the  train¬ 
ing  data  for  word  w.  For  example,  if  a  word  has  five  senses,  and  the  feature  Lx  :the  appears  in 
some  training  instance  for  each  of  the  five  senses,  then  lCFw(Li  :the)  =  —  log(5/5)  =  0,  correctly 
indicating  that  this  feature  is  not  at  all  useful  for  disambiguating  among  the  five  senses  of  this 
word. 

2.2  Adapting  WSD  Techniques  for  Lexical  Selection 

Adapting  UMD-SST  for  lexical  selection  in  MT  involves  a  straightforward  recasting  of  aligned 
target-language  words  as  sense  tags.  A  bilingual  corpus  is  aligned  using  standard  off-the-shelf 
tools  (GIZA++),  using  English  as  the  target  language.  The  set  of  “sense  tags”  for  a  word  is  the 
set  of  English  words  with  which  it  is  aligned,  possibly  filtered  (see  Section  3  for  details). 

In  the  current  adaptation  of  the  system  for  lexical  selection,  local  collocation  features  are 
defined  using  all  words  within  a  three- word  window  of  the  target  word,  and  wide-context  features 
are  defined  using  all  words  within  the  current,  previous,  and  following  sentence.  Grammatical 
features  are  not  used.  Consider  the  following  example  as  a  source  of  training  items. 

F.  estoy  de  acuerdo  con  el  en  cuanto  al  papel  central  que  debe  conservar  en  el  futuro  la  comision 
como  garante  del  interes  general  comunitario 

E.  i  agree  with  him  that  the  commission  must  continue  to  play  a  pivotal  role  as  guardian  of 
the  common  interests  of  the  community 
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In  this  sentence  pair,  consider  the  Spanish  word  papel,  aligned  with  role.  In  training  a  classifier 
for  disambiguating  the  word  papel,  the  sense  label  would  be  the  English  word  role,  and  features 
from  context  would  include  the  following: 

•  Local  collocates:  {L3:en,  L2:cuanto,  Ll:al,  Rl: central,  R2:que,  R3:debe } 

•  Wide-context  features:  Every  token  in  this  sentence  and  the  previous  and  following  sen¬ 
tences.  These  include  estoy ,  de,  acuerdo ,  etc. 

This  example  illustrates  an  essentially  homonymic  distinction:  the  word  papel  has  frequent 
translations  either  as  role  or  as  paper.  The  sentence  also  contains  a  nice  example  of  a  finer- 
grained  distinction:  central  is  here  aligned  with  pivotal,  though  it  is  also  frequently  translated  as 
central ,  and  sometimes  key,  decisive,  etc. 


2.3  Integrating  WSD  into  MT  Decoding 

The  baseline  MT  system  in  our  experiments  was  Pharaoh  (Koehn,  2004).  In  addition  to  being 
representative  of  current  phrase-based  statistical  MT  approaches,  and  therefore  a  proper  baseline 
for  comparison,  Pharoah  makes  it  possible  to  investigate  the  impact  of  alternative  lexical  selection 
decisions  while  keeping  the  rest  of  the  translation  framework  constant.  In  particular,  the  Pharaoh 
decoder  allows  the  option  of  including,  within  the  source  sentence,  XML  markup  indicating  trans¬ 
lation  possibilities  for  any  given  span  of  words  in  the  input.  This  can  be  useful  for  hard  rewrites  - 
e.g.  forcing  European  number  formats  like  3,14159  to  be  rendered  as  American  3.14159.  More  to 
the  point,  XML  markup  can  be  used  to  provide  soft  alternatives,  which  the  decoder  will  consider 
along  with  the  alternatives  posed  by  the  translation  model,  the  final  determination  being  made 
by  the  language  model. 

As  an  example,  consider  the  following  Spanish  input: 


sin  embargo  ,  sefior  presidente  tambien  es  realmente  necesario  que  en  se  vaya  poco 
mas  lejos... 

After  WSD  has  applied,  the  input  to  the  decoder  might  be: 


<n  english=" without | even|no"  prob=" . . . ">sin</n> 

<n  english="but | embargo lyet"  prob=" . . . ">embargo</n> 


<n  english=" sir | gentleman | mister"  prob=" . . . ">se\~nor</n> 

<n  english="president | chair | speaker"  prob=" . . . ">presidente</n> 


<n  english= 
<n  english= 
<n  english= 
<n  english= 
<n  english= 
<n  english= 
biarritz 
<n  english= 
<n  english= 


"also | too | even"  prob=" . . . ">tambi\,en</n> 

"is | it |be"  prob=" . . . ">es</n> 

"really | indeed | actually"  prob=". . . ">realmente</n> 
"need | necessary | must"  prob=" . . . ">necesario</n> 
"that | to | than"  prob=". . . ">que</n> 

"in|on|and"  prob=" . . . ">en</n> 

"be | is Ibeen"  prob=" . . . ">se</n> 

"go | goes | going"  prob=". . .">vaya</n> 
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un 

<n  english="little |bit | some"  prob=" . . . ">poco</n> 

<n  english="more | over | further"  prob=" . . . ">m\ ’ as</n> 
<n  english="f ar | away | af ar"  prob=" . . . ">lejos</n> 


For  the  sake  of  readability,  appears  here  in  lieu  of  probability  distributions,  and  the  sentence 
has  been  broken  across  multiple  lines. 

Crucially,  the  decoder  is  free  to  override  preferences  expressed  in  the  XML  markup,  e.g.  trans¬ 
lating  the  phrase  sin  embargo  as  “nevertheless”  rather  than  being  forced  into  something  more 
awkward  like  “even  yet”  .3  At  the  same  time,  the  choice  between  “Mr.  Speaker”  and  “Mr.  Presi¬ 
dent”  might  be  one  that  is  undetermined  by  sentence-level  context,  but  made  clear  in  the  context 
of  the  entire  document,  and  thus  amenable  to  a  nudge  in  the  right  direction  from  WSD  techniques 
that  take  advantage  of  document-level  context. 

3  Preliminary  Experimentation 

Our  preliminary  experiments  have  been  conducted  using  Spanish- English  Europarl  corpus  (Koehn, 
2003),  randomly  sampling  70000  word-aligned  sentences  for  training,  2000  for  development,  and 
2000  for  testing.  Classifiers  are  constructed  for  all  Spanish  words  —  not  lemmas,  and  not  just 
content  words.  The  set  of  possible  English  “senses”  for  a  word  is  the  set  of  English  words  with 
which  it  is  ever  aligned  in  the  training  data,  filtered  by  checking  a  hybrid  manual-statistical 
dictionary.4 

3.1  Quantitative  Results 

We  used  BLEU  rln4  (MTeval  version  11a)  —  that  is,  a  single  reference  translation  (rl)  and  match¬ 
ing  up  to  4-grams  (n4)  —  to  compare  the  Pharaoh  baseline  against  Pharaoh  with  WSD-based 
lexical  selection  recommendations.  The  reference  translation  was  simply  the  English  translation 
for  the  Spanish  test  item  in  the  Europarl  test  set.  The  decoder  output  differed  by  at  least  one 
token  for  56%  of  the  items  in  the  test  set.  Including  WSD-based  lexical  selection  provides  a  BLEU 
score  of  0.2382  as  compared  to  the  baseline  of  0.2356,  a  1.1%  relative  difference. 

3.2  Qualitative  Discussion 

Although  the  improvement  in  BLEU  score  is  small,  and  most  likely  not  statistically  significant,  it  is 
an  improvement  rather  than  a  decrease  in  performance  (cf.  (Carpuat  and  Wu,  2005)).  Moreover, 
looking  at  the  experimental  context,  and  considering  the  results  qualitatively,  there  are  some 
reasons  to  be  cautiously  optimistic  about  the  possibility  of  improving  the  results. 

First,  BLEU  with  a  single  reference  is  very  strict,  since  it  requires  an  exact  match  between 
tokens  in  the  MT  output  and  tokens  in  the  reference  translation.  The  decoder  using  WSD-based 

3This  is  accomplished  by  running  Pharaoh  with  the  -bypass  flag. 

4  We  are  grateful  to  Nizar  Habash  for  providing  the  manual  portion  of  the  dictionary.  Statistically  derived  entries 
were  obtained  by  computing  the  log-likelihood  ratio  for  aligned  word  pairs  (e,  /)  in  the  training  data,  sorting,  and 
keeping  the  100K  entries  for  which  the  log-likelihood  ratio  was  highest. 


6 


lexical  selection  appears  to  be  making  some  changes  that  should  be  considered  improvements,  but 
which  are  not  counted  under  this  strict  criterion.  For  example,  consider: 


SRC.  se  sabe  por  ejemplo  que  en  francia  la  cifra  de  ingresos  fiscales  varia  en  funcion  de  que  se 

/ 

tomen  las  estadisticas  de  la  direcion  general  de  contabilidad  publica  o  las  de  la  contabilidad 
nacional. 

REF.  it  is  known  that  in  france  ,  for  instance  ,  the  figure  for  tax  receipts  varies  according  to 
whether  you  use  the  statistics  of  the  direction  generale  de  la  comptabilite  publique  or  those 
of  the  comptabilite  nationale  . 

PHA.  is  in  france  ,  for  example  ,  the  number  of  tax  revenue  varies  according  to  take  the  statistics 
of  the  directorate  general  of  the  public  accounts  or  of  the  national  accounts  . 

WSD.  for  example  ,  we  know  that  the  figure  in  france  of  income  tax  varies  according  to  take  the 
statistics  of  the  directorate  general  of  the  public  accounts  or  of  the  national  accounts  . 

In  this  item,  the  WSD  prediction  suggests  that  sabe  should  be  translated  as  know,  which  pre¬ 
sumably  helps  guide  the  decoder  toward  translating  se  sabe  as  we  know  —  this  is  a  perfectly 
reasonable  translation,  even  though  the  reference  uses  it  is  known.  This  example  also  illustrates 
the  correct  choice  of  figure  rather  than  number. 

A  sampling  of  other  cases  where  the  WSD-enabled  lexical  selection  improves  on  Pharaoh,  but 
makes  reasonable  but  non-matching  choices  includes  alleviate  the  burden  (versus  relieve  the  burden 
in  the  reference  translation),  the  duty  to  remember  (versus  the  duty  of  remembrance ),  reflect 
seriously  about  (versus  reflect  seriously  on),  and  a  complete  success  (versus  a  triumph). 

Second,  WSD-based  guidance  on  lexical  choices  affects  sentence-level  translations  more  globally, 
not  just  at  the  level  of  individual  words.  Consider: 

SRC.  senor  presidente  ,  he  votado  a  favor  de  esta  carta  en  buena  parte  por  la  influencia  que  nue- 
stro  colega  ingo  friedrich  y  el  profesor  herzog  han  ejercido  en  su  contenido  . 

REF.  mr  president  ,  i  voted  in  favour  of  this  charter  ,  not  least  because  of  the  influence  which  our 
colleague  ,  ingo  friedrich  ,  and  professor  herzog  have  exerted  on  its  content  . 

BAS.  i  voted  for  this  in  a  letter  to  the  influence  mr  ingo  friedrich  and  professor  herzog  have  exer¬ 
cised  their  content  . 

NEW.  mr  president  ,  voted  in  favour  of  the  charter  in  large  part  by  the  influence  mr  ingo  friedrich 
and  professor  herzog  have  exercised  their  content  . 

The  baseline  decoder  chooses  to  translate  carta  as  letter  (or  perhaps  even  in  a  letter),  which  leads 
to  a  fragment  of  the  translation,  i  voted  for  this  in  a  letter,  that  is  perfectly  fluent  but  utterly 
incorrect.  In  contrast,  by  translating  carta  correctly  as  charter ,  the  decoder  enabled  with  WSD- 
based  lexical  selection  not  only  gets  that  word  correct,  but  also  creates  a  main  verb  phrase  that 
more  accurately  preserves  the  meaning  of  the  source,  voted  in  favor  of  the  charter.  Similarly, 
better  translation  of  function  words  sometimes  has  quite  a  large  effect  on  the  meaning.  Consider 
the  distinction  between  amendments  to  and  amendments  on. 
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SRC.  estamos  en  contra  de  las  enmiendas  sobre  la  masacre  de  los  armenios  precisamente  por  esa 
misma  razon  . 

REF.  we  opposed  the  amendments  on  the  armenian  massacre  for  exactly  this  reason  . 

BAS.  we  are  against  the  amendments  to  the  massacre  of  armenians  by  the  exactly  the  same  reason 

NEW.  we  are  against  the  amendments  on  the  massacre  of  armenians  by  the  exactly  the  same  reason 


Third,  the  current  version  of  the  system  is  entirely  naive  about  the  grammatical  category  of 
the  word  being  translated,  except  insofar  as  local  collocational  features  provide  stronger  evidence 
for  translations  in  one  category  versus  another.  In  some  cases  where  WSD-based  lexical  selection 
makes  incorrect  choices,  conditioning  on  grammatical  category  might  provide  a  better  distribution 
over  translations.  As  an  example,  the  WSD-based  preference  below  for  seguros  as  sure ,  rather 
than  insurance ,  leads  the  decoder  to  decrease  rather  than  increase  the  accuracy  of  the  translation. 

SRC.  en  efecto  ,  si  cada  vez  mas  europeos  acuden  a  los  seguros  complementarios  para  ser  reem- 

bolsados  ,  a  para  la  igualdad  de  acceso  a  la  asistencia  sanitaria  .el  sector  mutualista  sigue 

/ 

siendo  la  mejor  garantia  para  la  igualdad  de  acceso  a  la  asistencia  sanitaria  . 

REF.  if  more  and  more  europeans  turn  to  supplementary  health  insurance  in  order  to  reimburse 
health  care  costs  ,  the  mutualist  sector  will  remain  the  best  guarantee  for  equal  access  to 
care  . 

BAS.  if  increasingly  come  to  the  european  supplementary  insurance  to  be  reimbursed  ,  the  sector 
mutualista  remains  the  best  guarantee  for  the  equal  access  to  health  care  . 

NEW.  in  fact  ,  if  increasingly  come  to  the  european  complementary  sure  to  be  reimbursed  ,  the 
sector  mutualista  remains  the  best  guarantee  for  the  equal  access  to  health  care  . 

Biasing  the  translation  in  favor  of  a  noun  interpretation  of  seguros  might  well  lead  the  WSD-based 
selection  to  the  correct  conclusion,  and  consideration  of  a  variety  of  examples,  like  those  shown 
above,  suggests  that  introducting  a  bias  based  on  part-of-speech  would  not  hurt  in  other  cases 
where  WSD  is  already  going  in  the  right  direction. 

In  addition,  we  suspect  that  with  use  of  wider  context  —  for  example,  features  from  the  entire 
document  rather  than  the  three-sentence  window  — Ahere  would  be  more  topical  evidence  for  a 
more  specific  meaning  like  insurance  rather  than  a  lexical  choice  like  sure  that  is  more  generic 
and  a  priori  more  likely. 

Fourth,  it  is  worth  noting  that  the  current  experiment  applied  WSD-based  lexical  guidance 
across  the  board,  in  all  cases  where  a  distribution  could  be  obtained.  But  in  many  cases,  sense 
distributions  are  so  skewed  that  it  is  better  to  simply  use  the  predominant  sense  or  the  sense 
already  favored  by  the  decoder,  changing  this  default  only  when  there  is  strong  evidence  in  favor 
of  doing  so.  (This  is  related  to  one  of  the  reasons  WSD  has  had  very  limited  success  in  monolingual 
information  retrieval;  see  Resnik  (forthcoming)  for  discussion  of  relevant  literature.)  Taking  this 
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observation  into  account  suggests  using  confidence  assessment  techniques  and  providing  WSD- 
based  lexical  selection  bias  only  when  one  can  be  confident  in  the  choice.  One  way  to  do  this 
would  be  to  add  guidance  only  in  cases  having  a  high  value  for  Pr(WSD  >  PHA|e, /,  context) 
where  WSD  >  PHA  indicates  that  WSD-guided  lexical  selection  is  correct  when  Pharaoh’s  choice 
is  incorrect. 

Fifth  and  finally,  lexical  selection  in  Spanish  may  be  easier  than  in  other  languages  for  a 
phrase-based  MT  system  accomplishing  lexical  selection  using  just  its  phrase  table  and  language 
model.  For  a  language  like  Chinese,  where  there  are  likely  to  be  more  significant  word  order  and 
grammatical  category  divergences  with  English,  a  larger  arsenal  of  WSD  techniques  may  turn  out 
to  have  greater  advantages  over  local  context  alone.  Working  with  a  more  heterogeneous  corpus 
than  Europarl  might  have  a  similar  effect. 

4  Conclusions 

In  this  technical  report  we  have  proposed,  for  the  first  time,  an  integration  of  WSD  techniques 
with  statistical  phrase-based  translation  by  treating  target-language  lexical  items  as  “senses”. 
Doing  so  enables  us  to  take  advantage  of  existing  WSD  systems  by  using  large  aligned  bitexts  as 
a  source  of  training  data  for  supervised  approaches,  and  although  these  data  are  noisy,  all  manner 
of  sample  selection  techniques  are  therefore  available  as  ways  to  improve  training  data  quality. 

Work  still  needs  to  be  done  in  order  to  obtain  real  benefits  from  applying  WSD  techniques  in 
MT.  But  our  small  positive  (or  at  least  not  negative)  result  is  reassuring,  particularly  since  our 
baseline  system  is  stronger  than  the  baseline  statistical  MT  system  used  by  Carpuat  and  Wu’s 
(2005)  experiment.  We  hope  to  gain  from  the  insights  in  their  careful  analysis  of  negative  results, 
and  in  the  near  future  we  would  like  to  conduct  experiments  with  Chinese  in  order  to  obtain  a 
direct  comparison  of  approaches  that  are  and  are  not  mediated  by  a  Chinese  word  sense  inventory. 
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